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Abstract 

The problem of finding the optimal set of source 
nodes in a diffusion network that maximizes the 
spread of information, influence, and diseases 
in a limited amount of time depends dramati- 
cally on the underlying temporal dynamics of 
the network. However, this still remains largely 
unexplored to date. To this end, given a net- 
work and its temporal dynamics, we first des- 
cribe how continuous time Markov chains allow 
us to analytically compute the average total num- 
ber of nodes reached by a diffusion process star- 
ting in a set of source nodes. We then show 
that selecting the set of most influential source 
nodes in the continuous time influence maxi- 
mization problem is NP-hard and develop an 
efficient approximation algorithm with provable 
near-optimal performance. Experiments on syn- 
thetic and real diffusion networks show that our 
algorithm outperforms other state of the art al- 
gorithms by at least ^20% and is robust across 
different network topologies. 



1. Introduction 

In recent years, there has been an increasing ef- 
fort in uncovering, understanding, and controlling dif- 
fusion and propagation processes in a broad range 
of do mains : informati on propagation (|Leskovec et al. , 
120071) . social networks (|Kem pe et al j, 20031) . viral mar- 
keting (Richardson & Domingos, 2002), and epidemiol 



ogy ( Wallinga & TeuniX 2004 ). Diffusion networks have 
raised many research problems, ranging from network in- 



ference (IGomez-Rod riguez et al., 201 0}: | 201 II) to influence 



spread maximization dKempe et all , 20031) . In this article, 
we pay attention to the latter problem, and we propose a 
method for continuous time influence maximization that 
accounts for the temporal dynamics of diffusion networks. 



Influence spread maximization tackles the problem of se- 
lecting the most influential source node set of a given size 
in a diffusion network. A diffusion process that starts in 
such an influential set of nodes is expected to reach the 
greatest number of nodes in the network. In information 
propagation, the problem reduces to choosing the set of 
blogs and news media sites that together are expected to 
spread a piece of news to the greatest number of sites. 
In viral marketing, it consists of identifying the most in- 
fluential set of trendsetters that together may influence the 
greatest number of customers. Finally, in epidemiology, the 
influence maximization problem reduces to finding the set 
of individuals that together are most likely to spread an ill- 
ness or virus to the greatest percentage of the population. In 
this latter case, the solution of the influence maximization 
problem helps towards developing vaccination and quaran- 
tine policies. 

In our work, we build on the full y continuous time model 
of diff usion recently introduced bv lGomez-Rodriguez et aF 



(1201 II) . This model accounts for temporally heterogeneous 
interactions within a diffusion network - it allows informa- 
tion (or influence) to spread at different rates across dif- 
ferent edges, as shown in real-world examples. We first 
describe how, given a set of source nodes, we can com- 
pute the average total number of infec ted nodes analyti- 
cally using the work of iKulkarnil (119861) . The key observa- 
tion is that the infection time of a node in a network with 
stochastic edge lengths is the length of the stochastic short- 
est path from the source nodes to the node. Later, we show 
that finding the optimal influential set of source nodes in 
the continuous time influence maximization problem is a 
NP-hard problem. We then provide an approximation al- 
gorithm that finds a suboptimal set of source nodes with 
provable guarantees in terms of the average total number 
of infected nodes. 



Appearing in Proceedings of the 29 th International Conference 
on Machine Learning, Edinburgh, Scotland, UK, 2012. Copyright 
2012 by the author(s)/owner(s). 



Related work. iRichardson & Domingosl (|2002l) were the 
first to study influence maximization as an algorithmic pro- 
blem, motivated by marketing applications. In their work, 
they proposed heuristics for choosing a set of influential 
customers with a large overall effect on a network, and 
methods to infer the influence of each customer were de- 
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veloped. iKempe et alJ (120031) posed influence maximiza- 
tion in a social network as a discrete optimization problem. 
They showed that the optimal solution is NP-hard for se- 
veral models of influence, and obtained the first provable 
approximation guarantees for efficient algorithms based on 
a natural diminishing property of the problem, submodu- 
larity. Since then there have been substantial developments 
that build on their seminal work. Efficient influence maxi- 
mization that uses heuristics to speed up the optimiz ation 
problem has been proposed ( Chen et all 2009 : 2010l) and 



influence maximization has been studied on the context of 



competing cascades (Bharathi et all 120071) or under addi- 



tional constraints dGoval et all 12010b . 



However, to the best of our knowledge, previous work on 
influence maximization has ignored the underlying tempo- 
ral dynamics governing diffusion networks - once a trans- 
mission occurs, it always occurs at the same rate or tem- 
poral scale. In contrast, we consider heterogeneous pair- 
wise transmission rates, found in many real-world exam- 
ples. In information propagation, news media sites and pro- 
fessional bloggers typically report news faster than people 
that maintain personal blogs. In epidemiology, people meet 
each other with different frequencies and then the pairwise 
transmission rates between individuals within a population 
differ. Finally, in viral marketing, some customers make up 
their minds about a product or service quicker than others, 
and then pass recommendations on at different rates. 

The main contribution of our work is twofold. First, it 
considers a novel continuous time formulation of the in- 
fluence maximization problem in which information or in- 
fluence can spread at different rates across different edges, 
as in real-world examples. Second, this continuous time 
approach allows us to analytically compute and efficiently 
optimize the influence (i.e., average to tal number of in- 
fectio ns), avoiding the use of heuri stics dChen et al.Ll2010l; 
20091) or Monte Carlo simulations dKempe et alll2003l) . 



2. Problem formulation 

In this section, we build on the full y continuous time model 



of diff usion recently proposed by iGomez-Rodriguez et al 
(1201 lb . We start by describing how the diffusion model 



accounts for pairwise interactions and then continue dis- 
cussing some basic assumptions about diffusion processes. 
We conclude with a statement of the continuous time in- 
fluence maximization problem. 

Pairwise transmission likelihood. In a diffusion network, 
we first need to model the pairwise interactions between 
nodes. We pay attention to the general case in which di- 
fferent pairwise interactions between nodes in the network 
occur at different rates. Define f{tj\ti\ <Xi t j) as the con- 
ditional likelihood of transmission between a node i and 



a node j, where i; and tj are infection times and aij is 
the transmission rate. We assume that the likelihood de- 
pends on the pairwise transmission rate onj and the time 
difference (tj — ti) {i.e., it is time shift invariant). More- 
over, a node cannot be infected by a node infected later in 
time {i.e., tj > ti) and as ctij — > 0, the expected trans- 
mission time becomes arbitrarily long. 

In the remainder of the paper, we consider the exponen- 
tial distribution f(tj \U; otij) cx e - " 4 ^^' - **) to model 
pairwise interactions for the sake of simplicity. The ex- 
ponential model is a well-known parametric model for 
modeling diffusio n and influence in social an d infor- 



mation networks ( Gomez-Rodriguez et al. , 2010b . How 



ever, our results can easily be extended to diffusion net- 
works with phase-type pairwise transmission likelihoods. 
This is important since the set of phase-type distribu- 
tions is dense in the field of all positive-valued dis- 
tributions and it can used to approximate power-laws, 
which have been also used for modeling diffus i ons i n 
social networks (IGomez-Rodriguez & Scholkopfl 120121) . 
Rayle igh distributions, which ha ve been used in epidemio- 
logy (IWallinga & TeunisL 120041) . and also subprobability 
distributions, which en able us to describe two step tradi- 
tional diffusion models dKempe et alll2003l) . in which with 
probability (1 — /?) an infection may never occur. 



Continuous time diffusion process. We consider diffu- 
sion and propagation processes that occur over static net- 
works with known (or inferred) connectivity and trans- 
mission rates. A diffusion process starts when a source 
node set A becomes infected at time t = by action of 
an external source to the network. Then, source nodes try 
to infect their children (i.e., neighbors that they can reach 
directly through an outgoing edge). Once a child i gets 
infected at time ti, it tries to infect her own children, and 
so on. For some pairwise transmission likelihoods, it may 
happen that ti — > oo and child i is never infected. Here, 
we assume that a node i becomes infected as soon as one 
of its parents (i.e., neighbors that are able to reach node i 
through an outgoing edge) infects it, and later infections by 
other parents do not contribute anymore towards the evo- 
lution of the diffusion process. As a consequence of this 
assumption, at any time t > there may be some nodes 
and edges in the network that are useless for the spread of 
the information (be it in the form of a meme, a sales deci- 
sion or a virus) towards a specific node n. If these nodes 
get infected and transmit the information to other nodes, 
this information can only reach n through previously in- 
fected nodes. Therefore, the infection time t n of node n 
does not depend on these nodes. 

Finally, given a diffusion process that started in the set of 
source nodes A, we define N(A; T) as the number of nodes 
infected up to time T and then define the influence function 
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Figure 1. Panels (a,b): Sets of infected nodes (/; in red) and useless nodes (U„; in orange) at two different times for a diffusion process 
that starts in the source node set A = {3, 5} relative to a particular sink node (n; in black) . Any path from a useless node to the sink 
node is blocked by an infected node. The set of disabled (X n ) nodes is simply the union of the sets of infected and useless nodes. Panel 
(c): Sets of disabled nodes X G fi£ sucn that A C X. They represent the states that we need to describe the temporal evolution of a 
diffusion process towards the sink node n that starts in the set of sources A. 



c(A; T) as the average total number of nodes infected up 
to time T, i.e., a (A; T) = MN(A; T). 

Continuous time influence maximization problem. Our 

goal is to find the set of source nodes A in a diffusion net- 
work G that maximizes the influence function a (A; T). In 
other words, the set of source nodes A such that a diffusion 
process in G reaches, on average, the greatest number of 
nodes before a window cut off T. Thus, we aim to solve: 



A* = argmaxer(A; T), 

\A\<k 



(1) 



where the source set A is the variable to optimize and the 
time horizon T and the source set cardinality k are cons- 
tants. 

3. Proposed algorithm 

We start this section by describing how to evaluate the in- 
fluence function a (A; T) f or any set of sour ces A in a net- 
work G using the work of iKulkarrnl (1986). The key ob- 
servation is that the infection time of a node in a network 
with stochastic edge lengths is the length of the stochas- 
tic shortest path from the source nodes to the node. Then, 
we show that the continuous time influence maximization 
problem defined by Eq. [T] is NP-hard. Finally, we show 
how to efficiently find a provable near-optimal solution to 
our maximization problem by exploiting a natural dimin- 
ishing returns property of our objective function. 

Evaluating the influence. The influence function depends 
on the probability of infection of every node in the network 
as follows: 

N 

o-(A;T)=EN(A:T) = Y / P(tn<T\A), (2) 

n=l 

where t n is the infection time of node n, A is the set of 
source nodes, and T is the time horizon or time window 



cut-off. Therefore, we need to compute the probability of 
infection P(t n < T\A) for each node n in the network. 
Note that whenever n G A, the probability of infection 
P(tn < T\A) is trivially 1. We will refer to node n as sink 
node. 

Revisiting the basic assumptions about a diffusion process 
that we presented in Section [2] we recall some definition s 
to describe its temporal evolution as in iKulkarni (1986). 
Given a diffusion network G = (V, E), a set of nodes B c 
V, and a node n G V, we define the set of nodes blocked 
by or dominated by B: 

S n (B) ={u G V : any path from u to n in G visits 
at least one node in B}. 

By definition, B C S n {B) and S n (S n (B)) = S n (B). We 
now define the set f2* as: 

tt* n = {XcV: X = S n (X)}. 

In words, all nodes in X G Q* block only themselves re- 
lative t o the sink node n . We can find all sets in £1* effi- 
ciently dGeorgiadis et al. . 120061: Pro van & Shierlll996b . In 
particular, we are able to find each X G f2* in time 0(\V |). 
However, in dense networks, |f2* | can be exponentially 
large and lead to a worst-case non polynomial time algo- 
rithm. In order to illustrate this, we compute max„ |fi* | 
across 1, 000 random source sets with \S\ = 5 and |5| = 10 
for several 256-node hierarchical networks of increasing 
network density. We observe that max„ |fi*| < 85 for 
all networks up to 2 edges per node in average. How- 
ever, max„ |fi*| grows quickly for higher network densi- 
ties (e.g., max„ |fi* | < 7750 for a network with 2.5 edges 
per node in average). In order to overcome this draw- 
back, we will propose several speed-ups (LTP and LSN) 
that provide approximate solutio ns or sparsify the networks 
as in 



Mathi oudakis et al.1 (1201 ll) 



Given a diffusion process that starts in a set of source nodes 
A, a sink node n and any time t > 0, we denote the set 
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Table 1. Influence a(A; T) that enumeration, InfluMax and 
several other baselines achieve in a small Kronecker core- 
periphery network with 35 nodes and 39 edges for different time 
horizon values T and number of sources \A\. InfluMax always 
achieves the optimal influence that exhaustive search gives but 
several order of magnitude faster. 



of infected nodes as I(t\A), the set of useless nodes as 
U n (t\ A), and the set of disabled nodes (i.e., infected or use- 
less) as X n (t\A). Useless nodes are nodes that if they get 
infected and transmit the information to other nodes, this 
information can only reach the sink node n through pre- 
viously infected nodes. Figures [I (a)| and [T(b)| illustrate the 
set of infected nodes (/) and the set of useless nodes (U n ) 
for a diffusion process in a small network at two different 
times. Note that the set of disabled nodes (X n ) is com- 
posed of the sets of infected (/) and useless nodes ([/„). 
By definition of S n {-), U n (t\A) = S n {I(t\A))\I(t\A) 
and X n (t\A) = S n (I(t\A)). Now, by studying the tem- 
poral evolution of X n {t\A) we will be able to compute 
P(t„ < T\A). 

First, for a diffusion process that starts in the set of source 
nodes A, it can be shown that the set of disabled nodes 
X n (t\A) at any time t > belongs to fi* . 

Theorem 1. iKulkarni Given a set of source nodes 

A, a sink node n and any time t > 0, X n (t\A) £ f2*. 



Figure [T(cj| enumerates all sets of disabled nodes X £ £7* 
such that A C X for the small network depicted in Fi- 
gures ! 1(a)] and |l(b)| They represent the states that we need 
to describe the evolution of a diffusion process that starts in 
the set of sources A relative to the sink node n. Now, assu- 
ming independent pairwise exponential transmission like- 
lihoods in the diffusion network, the following Th. applies: 

Theorem 2. AKulkarm Given a set of source nodes 

A, a sink node n and independent pairwise exponential 



transmission likelihoods f(tj\ti] ctij), {X n (t\A),t > 0} 
is a continuous time Markov chain ( CTMC) with state 
space {X : X £ 17* , A C X} and infinitesimal generator 
matrix Q = [q(D,B)] (D, B £ {X : X £ fi£, A C X}) 
given by: 



q(D,B) = 



(i,j)£C„(D)' 



E 





j 3v:B = S n {DU{v}), 
id B = D, 
otherwise. 



where C (D) is the unique minimal cut between D and D = 
V\D andC v (D) = {(u,v) £ C{D)}. 

Finally, let t n be the length of the fastest (shortest) di- 
rected path from any of the nodes in A to the sink node n 
in the directed acyclic graph (DAG) induced by the diffu- 
sion process on network G. By construction of the CTMC 
{X n (t\A),t > 0} in TheoremE] 

tn = min{i > : X n (t\A) = S N \X n {0\A) = Si}, 

where Si and Sn denote respectively the first and last state 
of the CTMC. The length of the fastest (shortest) path is 
thus equivalent to the time until the CTMC {X n (t\A),t > 
0} becomes absorbed in the final state Sn starting from 
state Si (i.e., the state in which only the source nodes in 
A are infected). Then, computing the probability of in- 
fection of the sink node P(t n < T\A) reduces to compu- 
ting the distribution of time of the sink state of the CTMC. 
Such distributions are called continuous phase-type distri- 
butions. Their gener ator matrix Q and the cumulati ve den- 
sity function satisfy dGikhman & Skorokhodl 2004 ): 



P(t n < T\A) = 1 - [10]'e 5T l, where Q = 



S S° 
0' 



where e ST denotes the exponential matrix, S is the subma- 
trix of Q that results from removing the column and row 
associated to the last state Sn, and S° = —SI. By con- 
struction, {X n (t\A), t > 0} has the structure of a DAG and 
it is usually sparse. Then, S is upper triangular, sparse and 



can be computed efficiently. 



As noted in Kulkarni (fl986h . this approach can be eas- 
ily extended to diffusion networks with phase-type trans- 
mission likelihoods, which can approximate power-laws, 
Rayleigh or subprobability distributions. 

Maximizing the influence. We have shown how to ana- 
lytically evaluate our objective function o~(A\ T) for any set 
of sources A. However, optimizing a(A;T) with respect 
to the set of sources A seems to be a cumbersome task and 
naive brute-force search over all k node sets is intractable 
even for relatively small networks. Indeed, we cannot ex- 
pect to find the optimal solution to the continuous time in- 
fluence maximization problem defined by Eq. Q] since it is 
NP-hard: 
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Figure 2. Panels plot influence a(A;T) (i.e., average number of infected nodes) for T = 1 and transmission rates drawn from 
a ~ U(0, 5) against number of sources, (a): 1,024 node Forest Fire network, (b): 512 node random Kronecker network, (c): 1,024 
node hierarchical Kronecker network. The proposed algorithm InfluMax outperforms all other methods typically by at least 20%. 



Theorem 3. Given a network G = (V, E), a set of nodes 
A C V and a time horizon T, the continuous time influence 
maximization problem defined by Eq.yjis NP-hard. 

Proof. If we let T — > oo, the independent cascade 
model is a particular case of our continuous time diffusion 
m odel. Then, our prob lem is NP-hard by applying Th. 2.4 
inlKempe etafl d2003l). □ 



By construction, <r(0,T) = and a(A; T) > 0. It also 
follows trivially that cr(A; T) is monotonically nondecrea- 
sing in the set of source nodes A, i.e., a {A; T) < a (A'; T), 
whenever A C A' . Fortunately, we now show that the ob- 
jective function a(A; T) is a submodular function in the set 
of source nodes A. A set function F : 2 M/ — > M mapping 
subsets of a finite set W to the real numbers is submodular 
if whenever A C B C W and s £ W \ B, it holds that 
F(A U {s}) - F{A) > F(B U {s}) - F{B), i.e., adding s 
to the set A provides a bigger marginal gain than adding s 
to the set B. By this natural diminishing returns property, 
we are able to find a provable near-optimal solution to our 
problem: 

Theorem 4. Given a network G = (V, E), a set of nodes 
A C V and a time horizon T, the influence function 
u(A] T) is a submodular function in the set of nodes A. 



Proof. We follow the proof of Th. 2.2 in iKempe et al 
(120031) . For simplicity, we assume that the infection time 
of all nodes in A is t = 0; the results generalize tri- 
vially. Consider the probability distribution of all possible 
time differences between each pair of nodes in the network. 
Thus, given a sample At in the probability space, we de- 
fine <r&.t(A; T) as the total number of nodes infected in a 
time less than or equal to T for At. 

Define i?At(fc; T) as the set of nodes that can be reached 
from node k in a time shorter than T. It follows tri- 
vially that a At (A; T) = | U keA R^ t (k;T)\. Define 
R^t{k\N; T) as the set of nodes that can be reached from 
node k in a time shorter than T and at the same time can- 
not be reached in a time shorter than T from any node in 



the set of nodes N CV.lt follows that \R^ t (k\N; T)\ > 
\RAt(k\N';T)\ for the sets of nodes N C N'. 

Consider now the sets of nodes A C A' C V, and a node a 
such that a £ A'. Using the definition of submodularity, 

CTAt(AU{a};T)-a At (A;r) = \R± t (a\A;T)\ 
> \RAt(a\A';T)\ 

= a At (A' U {a};T) - a At (A';T), 

and thus a^t(A;T) is submodular. Then, it follows that 
a (A; T) is also submodular. □ 

A well-known approximation algorithm to maximize mo- 
notonic submodular functions is the greedy algorithm. It 
adds nodes to the source node set A sequentially. In step 
k, it adds the node a which maximizes the marginal gain 
cr(A fc _i U {a};T) - cr(A fc _i;T). The greedy algorithm 
finds a source node set which ac hieves at least a constan t 
fraction (1 — 1/e) of the optimal dNemhauser et al.Lll978h . 

Moreover, we can also use the submodularity of a(A;T) 
to acquire a tight online bound on the solution quality ob- 
tained by any algorithm: 

Theorem 5 (ILeskovec et al. I (l2007k For a source set A C 
V with k sources and a node a G V\A, let S a = a(A U 
{a}; T)—a(A; T) anda\, . . .a^be the sequence ofk nodes 
with S a in decreasing order. Then, max|^|<fc o~(A; T) < 

<7(i;T) + £ti<W 



Lazy evaluation ( ILeskovec et al.L 12007b can be employed 
to speed-up the computation of the on-line bound for our 
algorithm, that we will refer as InfluMax. 

Speeding-up INFLUMAX. We can speed up our algorithm 
by implementing the following speed-ups: 



Lazy evaluation (LE, ILeskovec et alj d2007l) ): it dramati- 
cally reduces the number of evaluations of marginal gains 
by exploiting the submodularity of a(A; T). 

Localized source nodes (LSN): for each node n, we speed 
up the computation of P(t n < T\ A) by ignoring any a e A 
whose shortest path to n traverses more than m nodes. 
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Figure 3. Influence o{A\ T) achieved by InfluMax in compari- 
son with the online upper bound from Theorem[5]for T — 1. (a) 
35-node core-periphery Kronecker network, (b) 1,024 node hi- 
erarchical Kronecker network, (c) 1,000 node real diffusion net- 
work that we infer from hyperlinks cascades (T = 1). 
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Figure 4. Panels show (a) influence a(A; T) vs. time horizon and 
(b) average computation time per source added for InfluMax 
implemented with (i) lazy evaluation (LE), (ii) LE and localized 
source nodes (LSN, m = 6), and (iii) LE and limited transmission 
paths (LTP, 7Ji = 6) against number of nodes. 



Limited transmission paths (LTP): for each node n, we 
speed up the computation of P(t n < T\A) by ignoring any 
path from oeiton that traverses more than m nodes. 

LSN and LTP should be used with care since they provide 
an approximate P(t n < T\A). In the remainder of this 
article, if not specified, we run InfluMax with LE but 
avoid using LSN and LTP. 

4. Experimental evaluation 

We evaluate our algorithm InfluMax on (i) syn- 
thetic networks that mimic the structure of real net- 
works and on (ii) real networks inferred from the Meme- 
Tracker d atasetj by using NetRate's public implemen- 



tation dGomez-Rodriguez et al.L 1201 II) . We show that 



InfluMax outperforms three state of the art alg orithms: 
the tra ditional greedy a lgorithm (iKempe et all | 2003l) . 
PMIA dChen et all 1201 Oh and SP1M dChen etall \200% . 



4.1. Experiments on synthetic data 

Experimental setup. We perform experiments on two 
types of synthetic networks that mi mic the structure of di- 
rected social networks: Kro necker ( Leskovec et al. , 2010b 



and Forest Fire (scale free) (IBarabasi & Albert! 11999b net- 
works. We consider three types of Kro necker networks 
with very different structure: random (lErdos & Renyi 
196dh (parameter matrix [0.5, 0.5; 0.5, 0.5]), hierar 



periphery (ILeskovec et al 



chial (IClauset et al.L 12008 ) (r0.9 0.1; 0.1; 0.9]) and core- 



20101) (m.9. 0.5; 0.5, 0.3]). 



First, we generate a network G using one of the network 
models cited above. Then, we draw a transmission rate for 
each edge (J, i)6G from a uniform distribution. We can 
control the transmission rate variance across edges in the 
network by tuning the parameters values of the distribu- 
tion. In social networks, transmission rates model how fast 
information spreads across the network. Given G and the 



transmission rates aij i, our aim is to find the most influen- 
tial subset of k nodes, i.e., the subset of nodes that maxi- 
mizes the spread of information up to a time T. In the tra- 
ditional greedy algorithm, PMIA and SP1M, we ignore any 
of the transmission rates and consider all network edges to 
be active with probability 1, i.e., we do not consider the 
temporal dynamics. We did not need to use Montecarlo in 
the traditional greedy algorithm since we assume all edges 
to be always active. 

Solution quality. First, we compare InfluMax to ex- 
haustive search and several state of the art algorithms on 
a small network. By studying a small network in which 
exhaustive search can be run, we are able estimate exactly 
how far InfluMax is from the NP-hard to find optimum. 
We then compare InfluMax to the state of the art on 
different large networks. Running exhaustive search on 
large networks is computationally too expensive and we 
compute instead the tight on-line bound from Th. 

We compare InfluMax to several state of the art meth- 
ods on a small core-periphery Kronecker network with 35 
nodes and 39 edges and transmission rates drawn from a 
uniform distribution a ~ l7(0, 10). We summarize the 
results in Table Q] In addition to InfluMax and three 
state of the art methods, we also run a baseline that sim- 
ply chooses the set of sources randomly. For all meth- 
ods, we compute the influence they achieve by evaluating 
Eq.[2]for the set of sources selected by them. Surprisingly, 
InfluMax achieves in most cases the optimal influence 
that exhaustive search gives but several order of magni- 
tude faster. In other words, the solution given by Influ- 
Max may be in practice much closer to the NP-hard to find 
op timum than (1 — 1/e), the theoretical guarantee given 



'Data available at http : / /memetracker . org| 



bv lNemhauser et alj (119781) . and it outperforms other meth- 
ods by 20%. 

Now, we focus on different large synthetic networks. Fi- 
gure |2 shows the average total number of infected nodes 
against number of sources that InfluMax achieves in 
comparison with the other methods on a 512 node ran- 
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(a) Hyperlink cascades (b) MemeTracker cascades 

Figure 5. Influence cr(A; T) for time horizon T = 1 against num- 
ber of sources for (a) a 1 ,000 node real diffusion network that we 
infer from hyperlinks cascades and (b) a 1,000 node real diffusion 
network that we infer from MemeTracker cascades. The proposed 
algorithm InfluMax outperforms all other methods by 20-25%. 



dom Kronecker network, a 1,024 node hierarchical Kro- 
necker network and a 1,024 node Forest Fire (scale free) 
network. All three networks have approximately 2 edges 
in average per node. We set the time horizon to T = 1.0 
and the transmission rates are drawn from a uniform dis- 
tribution a ~ £/(0,5). InfluMax typically outperforms 
other methods by at least 20% by exploiting the temporal 
dynamics of the network. We also compare InfluMax 
with the on-line bound from Th. [5] Fig. [3] shows the av- 
erage number of infected nodes against number of sources 
that InfluMax achieves in comparison with the on-line 
bound for the small core-periphery Kronecker network and 
the large hierarchical Kronecker network that we used pre- 
viously. If we pay attention to the value of the bound on 
the small network for source set sizes significantly smaller 
than the number of nodes in the network, we observe that 
the bound value on the influence is not as close to the op- 
timal value given by exhaustive search as we could expect. 
That means that although the bound is not very tight on the 
large network, we may be actually achieving in practice an 
almost optimal value on that network too. 

Influence vs. time horizon. Intuitively, the smaller the 
time horizon, the more important the temporal dynamics 
become when choosing the subset of most influential nodes 
of a given size. Fig. |4(a)| shows the average total number 
of infected nodes against time horizon for a hierarchical 
Kronecker network with 1,024 nodes and approx. 2 edges 
per node. We consider a source set of cardinality \A\ = 10 
and we draw the transmission rate of each edge from a uni- 
form distribution a ~ U(0,5). The experimental results 
for all transmission rates configurations confirm the initial 
intuition, i.e., the difference between InfluMax and other 
methods is greater for small time horizons. 



Running time. Fig. |4(b)| shows the average computation 
time per source added of our algorithm implemented (i) 
with lazy evaluation, (ii) with lazy evaluation and localized 
source nodes with m = 6 hops and (iii) with lazy evalua- 
tion and limited transmission paths with m = 6 hops on a 



single CPU (2.3 Ghz Dual Core with 4 GB RAM). We use 
hierarchical Kronecker networks with an increasing num- 
ber of nodes but approximately the same network density 
since real networks are usually sparse. Remarkably, the 
number of hops that we use in localized source nodes and 
limited transmission paths result in an approximation error 
for the influence a(A; T) of at most 10%, while achieving 
an speed-up of ^5x for the largest network (2,048 nodes). 

4.2. Experiments on real data 

Experimental setup. We used the publicly available 
MemeTracker dataset, which cont ains more than 172 mi - 
llion news articles and blog posts (ILeskovec et all 120091) . 
We trace the information in two different ways and 
then i nfer two different diffusion ne tworks using Net- 
Rate dGomez-Rodriguez et afll201 lb . 



First, we find more than 100,000 hyperlink cascades in the 
MemeTracker dataset. Each hyperlink cascade consists of 
a collection of time-stamped hyperlinks between sites (in 
blog posts) that refer to closely related pieces of informa- 
tion. From the hyperlink cascade data, we infer an under- 
lying diffusion networks with the top (in terms of hyper- 
links) 1,000 media sites and blogs. Second ; we ap ply the 



MemeTracker methodology (ILeskovec et all 12009b to find 



343 million short textual phrases. We cluster the phrases 
to aggregate different textual variants of the same phrase 
and consider the 12,000 largest clusters. Each phrase clus- 
ter is a MemeTracker cascade. Each cascade consists of a 
collection of time-stamps when sites (in blog posts) first 
mentioned any phrase in the cluster. From the Meme- 
Tracker cascades, we infer an underlying diffusion network 
with the top (in terms of phrases) 1,000 media sites and 
blogs. Then, we sparsify further the networks by keeping 
the 1,000 fastest edges since it has been shown that in the 
context of influence maximization, computations on spar- 
sifi ed models give up little acc uracy, but improves scalabil- 
ity dMathioudakis et alll201 lb . 



Solution quality. Fig. [5] shows the average total number 
of infected nodes against number of sources that Influ- 
Max achieves in comparison with other methods for both 
real networks, that were inferred from the hyperlink cas- 
cade and the MemeTracker cascade datasets, as described 
above. We set the time horizon to T = 1.0. Again, Influ- 
Max outperforms all other methods typically by ~30%, 
by considering the temporal dynamics of the diffusion. Fi- 
nally, we also compare InfluMax with the on-line bound 
from Th. [5] for the real network that we inferred from the 
hyperlink cascade dataset in Fig. |3(c)| Similarly to the syn- 
thetic networks, the bound is not as tight as expected. 
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5. Conclusions 

We have developed a method for influence maximization, 
InfluMax, that accounts for the temporal dynamics un- 
derlying diffusion processes. The method allows for varia- 
ble transmission (influence) rates between nodes of a net- 
work, as found in real-world scenarios. Perhaps surprising- 
ly, for the rather general case of continuous temporal dy- 
namics with variable transmission rates, we can evaluate 
the influence of any set of source nodes i n a network analy- 
tically using the work of Kulkarnil ( 1986 ). In this analytical 
framework, we find the near-optimal set of nodes that max- 
imizes influence by exploiting the submodularity of our ob- 
jective function. In addition, the reevaluation of influence 
for changes on the network is straightforward and the algo- 
rithm parallelizes naturally by sink and source nodes. 

We evaluated our algorithm on a wide range of synthetic 
diffusion networks with heterogeneous temporal dynamics 
which aim to mimic the structure of real-world social and 
information networks. Our algorithm is remarkably stable 
across different network topologies. It outperforms state of 
the art methods in terms of influence (i.e., average number 
of infected nodes) for different network topologies, time 
horizons and source set sizes. InfluMax typically gives 
an influence gain of ^25% and it achieves the greatest im- 
provement for small time horizons; in such scenarios, the 
temporal dynamics play a dramatic role. We also evaluated 
InfluMax on two real diffusion networks that we inferred 
from the MemeTracker dataset using NetRate. Again, it 
drastically outperformed the state of the art by ^30%. 

We believe that InfluMax provides a novel view of the 
influence maximization problem by accounting for the un- 
derlying temporal dynamics of diffusion networks. 

References 

Barabasi, A.-L. and Albert, R. Emergence of scaling in 
random networks. Science, 286:509-512, 1999. 

Bharathi, S., Kempe, D., and Salek, M. Competitive in- 
fluence maximization in social networks. Internet and 
Network Economics, pp. 306-31 1, 2007. 

Chen, W., Wang, Y., and Yang, S. Efficient influence max- 
imization in social networks. In KDD, 2009. 

Chen, W., Wang, C, and Wang, Y. Scalable influence max- 
imization for prevalent viral marketing in large-scale so- 
cial networks. In KDD, 2010. 

Clauset, A., Moore, C, and Newman, M. E. J. Hierarchical 
structure and the prediction of missing links in networks. 
Nature, 453(7191):98-101, 2008. 

Erdos, R and Renyi, A. On the evolution of random graphs. 



Publication of the Mathematical Institute of the Hungar- 
ian Academy of Science, 5: 17-67, 1960. 

Georgiadis, L., Werneck, R.F., Tarjan, R.E., et al. Finding 
dominators in practice. Journal of Graph Algorithms and 
Applications, 10(l):69-94, 2006. 

Gikhman, I.I. and Skorokhod, A. V. The theory of stochastic 
processes, volume 2. Springer Verlag, 2004. 

Gomez-Rodriguez, M. and Scholkopf, B. Submodular In- 
ference of Diffusion Networks from Multiple Trees. In 
ICML, 2012. 

Gomez-Rodriguez, M., Leskovec, J., and Krause, A. In- 
ferring Networks of Diffusion and Influence. In KDD, 
2010. 

Gomez-Rodriguez, M., Balduzzi, D., and Scholkopf, B. 
Uncovering the Temporal Dynamics of Diffusion Net- 
works. In ICML, 2011. 

Goyal, A., Bonchi, E, Lakshmanan, L.V.S., et al. Approxi- 
mation Analysis of Influence Spread in Social Networks. 
Arxiv preprint arXiv: 1008.2005, 2010. 

Kempe, D., Kleinberg, J. M., and Tardos, E. Maximiz- 
ing the spread of influence through a social network. In 
KDD, 2003. 

Kulkarni, V.G Shortest paths in networks with exponen- 
tially distributed arc lengths. Networks, 16(3):255-274, 
1986. 

Leskovec, J., Krause, A., Guestrin, C, et al. Cost-effective 
outbreak detection in networks. In KDD, 2007. 

Leskovec, J., Backstrom, L., and Kleinberg, J. Meme- 
tracking and the dynamics of the news cycle. In KDD, 
2009. 

Leskovec, J., Chakrabarti, D., Kleinberg, J., et al. Kro- 
necker graphs: An approach to modeling networks. 
JMLR, 11:985-1042, 2010. 

Mathioudakis, M., Bonchi, E, Castillo, C, et al. Sparsifi- 
cation of influence networks. In KDD, 201 1. 

Nemhauser, GL, Wolsey, LA, and Fisher, ML. An anal- 
ysis of approximations for maximizing submodular set 
functions. Mathematical Programming, 14(1), 1978. 

Provan, J.S. and Shier, D.R. A paradigm for listing (s, t)- 
cuts in graphs. Algorithmica, 15(4):35 1-372, 1996. 

Richardson, M. and Domingos, P. Mining knowledge- 
sharing sites for viral marketing. In KDD, 2002. 

Wallinga, J. and Teunis, P. Different epidemic curves for 
severe acute respiratory syndrome reveal similar impacts 
of control measures. American Journal of Epidemiology, 
160(6):509-516,2004. 



